近年来,由于其在数字人物,角色产生和动画中的广泛应用,人们对3D人脸建模的兴趣越来越大。现有方法压倒性地强调了对面部的外部形状,质地和皮肤特性建模,而忽略了内部骨骼结构和外观之间的固有相关性。在本文中,我们使用学习的参数面部发电机提出了雕塑家,具有骨骼一致性的3D面部创作,旨在通过混合参数形态表示轻松地创建解剖上正确和视觉上令人信服的面部模型。雕塑家的核心是露西(Lucy),这是与整形外科医生合作的第一个大型形状面部脸部数据集。我们的Lucy数据集以最古老的人类祖先之一的化石命名,其中包含正牙手术前后全人头的高质量计算机断层扫描(CT)扫描,这对于评估手术结果至关重要。露西(Lucy)由144次扫描,分别对72名受试者(31名男性和41名女性)组成,其中每个受试者进行了两次CT扫描,并在恐惧后手术中进行了两次CT扫描。根据我们的Lucy数据集,我们学习了一个新颖的骨骼一致的参数面部发电机雕塑家,它可以创建独特而细微的面部特征,以帮助定义角色,同时保持生理声音。我们的雕塑家通过将3D脸的描绘成形状混合形状,姿势混合形状和面部表达混合形状,共同在统一数据驱动的框架下共同建模头骨,面部几何形状和面部外观。与现有方法相比,雕塑家在面部生成任务中保留了解剖学正确性和视觉现实主义。最后,我们展示了雕塑家在以前看不见的各种花式应用中的鲁棒性和有效性。
translated by 谷歌翻译
生产级别的工作流程用于产生令人信服的3D动态人体面孔长期以来依赖各种劳动密集型工具用于几何和纹理生成,运动捕获和索具以及表达合成。最近的神经方法可以使单个组件自动化,但是相应的潜在表示不能像常规工具一样为艺术家提供明确的控制。在本文中,我们提出了一种新的基于学习的,视频驱动的方法,用于生成具有高质量基于物理资产的动态面部几何形状。对于数据收集,我们构建了一个混合多视频测量捕获阶段,与超快速摄像机耦合以获得原始的3D面部资产。然后,我们着手使用单独的VAE对面部表达,几何形状和基于物理的纹理进行建模,我们在各个网络的潜在范围内强加了基于全局MLP的表达映射,以保留各个属性的特征。我们还将增量信息建模为基于物理的纹理的皱纹图,从而达到高质量的4K动态纹理。我们展示了我们在高保真表演者特异性面部捕获和跨认同面部运动重新定位中的方法。此外,我们的基于多VAE的神经资产以及快速适应方案也可以部署以处理内部视频。此外,我们通过提供具有较高现实主义的各种有希望的基于身体的编辑结果来激发我们明确的面部解散策略的实用性。综合实验表明,与以前的视频驱动的面部重建和动画方法相比,我们的技术提供了更高的准确性和视觉保真度。
translated by 谷歌翻译
已知深神经网络(DNN)容易受到对抗性攻击的影响,即对输入的不可察觉的扰动可以误导DNN在清洁图像上培训,以制造错误的预测。为了解决这一目标,对抗性训练是目前最有效的防御方法,通过增强速度设定的训练,在飞行中产生的对抗样本。有趣的是,我们首次发现,在随机初始化的网络中,在没有任何模型训练的随机初始化网络中,第一次发现具有天生稳健性,匹配或超越对抗训练网络的强大准确性的鲁棒准确性,表明对模型权重的对抗训练不是对抗性鲁棒性不可或缺。我们命名为强大的临时票故障票(RST),也是自然效率的那种。不同于流行的彩票假设,既不需要培训原始密集的网络也不需要训练。为了验证和理解这种迷人的发现,我们进一步开展了广泛的实验,以研究不同模型,数据集,稀疏模式和攻击下RST的存在性和性质,绘制关于DNNS鲁棒性与其初始化/过度分辨率之间的关系的洞察。此外,我们确定从同一随机初始化的密集网络绘制的不同稀疏比率的RST之间的差的对抗性转移性,并提出了一种随机切换不同RST之间的随机切换的随机性,作为基于顶部的新型防御方法第一次。我们相信我们对RST的调查结果已经开辟了一个新的视角,以研究模型稳健性并扩大彩票假设。
translated by 谷歌翻译
Human modeling and relighting are two fundamental problems in computer vision and graphics, where high-quality datasets can largely facilitate related research. However, most existing human datasets only provide multi-view human images captured under the same illumination. Although valuable for modeling tasks, they are not readily used in relighting problems. To promote research in both fields, in this paper, we present UltraStage, a new 3D human dataset that contains more than 2K high-quality human assets captured under both multi-view and multi-illumination settings. Specifically, for each example, we provide 32 surrounding views illuminated with one white light and two gradient illuminations. In addition to regular multi-view images, gradient illuminations help recover detailed surface normal and spatially-varying material maps, enabling various relighting applications. Inspired by recent advances in neural representation, we further interpret each example into a neural human asset which allows novel view synthesis under arbitrary lighting conditions. We show our neural human assets can achieve extremely high capture performance and are capable of representing fine details such as facial wrinkles and cloth folds. We also validate UltraStage in single image relighting tasks, training neural networks with virtual relighted data from neural assets and demonstrating realistic rendering improvements over prior arts. UltraStage will be publicly available to the community to stimulate significant future developments in various human modeling and rendering tasks.
translated by 谷歌翻译
对不确定度和鲁棒性的高质量估计对于众多现实世界的应用来说至关重要,特别是对于深入学习,这是利用许多部署的ML系统。因此,比较改善这些估计的技术的能力对于研究和实践相似非常重要。然而,由于一系列原因,通常缺乏方法的竞争比较,包括:计算广泛调整的可用性,加入足够多的基线,以及用于再现性的具体文件。在本文中,我们介绍了不确定性的基线:在各种任务中的标准和最先进的深度学习方法的高质量实现。从本撰写中,集合跨越9项方法,每个方法都有至少5个度量。每个基线都是一个独立的实验管道,易于可重复使用和可伸缩的部件。我们的目标是提供具有新方法或应用的实验的即时出发点。此外,我们还提供模型检查点,实验输出为Python笔记本,以及用于比较结果的排行榜。代码在https://github.com/google/uncertainty-baselines。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译